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The discovery of network structures is a fundamental problem in arising in numerous fields 
of science and technology, communication systems, biology, sociology and neuroscience. Unfor- 
tunatcly, it is often difficult to obtain data that directly reveals network structure, and so one 
must infer a network from incomplete data. This paper considers inferring network structure 
from "co-occurrence" data; observations that identify which network components (e.g., switches, 
routers, genes) carry each transmission but does not indicate the order in which they handle the 
transmissions. Without order information, there is an exponential number of feasible networks 
that are compatible with the observed data. Yet, the basic physical principles underlying most 
networks strongly suggest that all feasible networks are not equally likely. In particular, net- 
work elements that co-occur in many observations are probably closely connected. We model 
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icantly simplify the problem, but the computational complexity of the reconstruction process 
does grow exponentially in the length of the longest transmission path. For large networks the 
exact E-step may be computationally intractable, and so we also propose an efficient Monte 
Carlo EM (MCEM) algorithm, based on importance sampling, and derive conditions which 
ensure convergence of the algorithm with high probability. Remarkably, the MCEM maintains 
the desirable properties of the exact EM algorithm and reduces the complexity of each iteration 



graph, subjected to a random permutation which accounts for the lack of order information. 
Treating the permutations as missing data, we derive an exact expectation-maximization (EM) 
algorithm for estimating the random walk parameters. The model and EM algorithm signif- 
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to polynomial-time. Simulations and experiments with Internet measurements demonstrate the 
promise of this approach. 

1 Network Inference and Co-Occurrence Observations 

The study of complex networked systems is an emerging field, impacting nearly every area of engi- 
neering and science, including the important domains of communication systems, biology, sociology, 
and cognitive science. The analysis of communication networks enables a better understanding of 
routing, transmission patterns, and information flow [6,21]. The analysis of biological networks 
provides insight into the functional roles played by different genes, proteins, and metabolites in 
biological systems [13,19]. The analysis of social networks contributes to a deeper understanding 
of interactions, dynamics, and the structure of organizations [18,25]. The analysis of functional 
connectivity networks in the brain is necessary for the understanding of couplings and interactions 
between different neuronal colonies [1,23,24]. Obtaining or inferring the structure of networks from 
experimental data precedes any such analysis and is thus a basic and fundamental task, critical to 
many applications. 

Unfortunately, measurements which directly reveal network structure are often beyond experi- 
mental capabilities or are excessively expensive. This paper considers inferring network structure 
from observations that identify which network components (e.g., switches, routers, genes) carry 
each transmission but does not indicate the order in which they handle the transmissions. Mathe- 
matically, the underlying network structure can be represented as a directed graph, and the vertices 
involved in each transmission form a connected subgraph. The observations only reflect which sub- 
set of vertices are involved or "occur" in each transmission; not their inter-connectivity. We refer 
to such observations as co-occurrences. Co-occurrence observations arise naturally in each of the 
application areas mentioned above. 

Transmissions over telecommunication networks are carried by links and routers / switches which 
form a path between the source and terminal nodes. In some cases, it is impossible to directly ob- 
serve the order in which the routers/switches handle each transmission, since sensors are geographi- 
cally distributed, making precise time-synchronization impractical. The so-called internally-sensed 
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network tomography problem specifically aims at recovering network structure from unordered lists 
of network elements along transmission paths [21]. 

Biological signal transduction networks describe fundamental cell functions such as growth, 
metabolism, differentiation, and apoptosis (disintegration) [19]. Although it is possible to test for 
individual, localized interactions between genes pairs, such experiments are expensive and time- 
consuming. High-throughput measurement techniques such as microarrays have successfully been 
used to identify the components of different signal transduction pathways [28]. However, microarray 
data only reflects order information at a very coarse, unreliable level. Developing computational 
techniques for inferring pathway orders is an active research area [16]. 

Co-occurrence or transactional data also appears in the context of social networks, e.g., by 
considering which academic papers are co-cited by another paper, which web pages are linked to 
or from another web page, or which people were diagnosed with a common disease on the same 
day. Such measurements are readily available, but do not necessarily reflect the temporal or other 
natural order of occurrence. Researchers in this area have considered the problems of reconstructing 
networks from co-occurrence data and of using the inferred network to predict potential future co- 
occurrences [14]. 

Functional magnetic resonance imaging (fMRI) provides a mechanism for measuring activity in 
the brain with high spatial resolution. By observing which regions of the brain co-activate while a 
patient is performing different tasks we can obtain multiple co-occurrence observations. Although 
fMRI offers high spatial resolution it has limited temporal resolution, making it impractical to 
obtain complete order information. Magnetoencephalography and electroencephalography measure 
activity in the brain with higher temporal resolution but only provide coarse spatial resolution, and 
thus may not allow one to determine precisely which functional regions are active during a given 
task. Existing techniques for obtaining functional co-activation networks either involve brute-force 
measurement or use crude correlation methods (see [24] and references therein). 

In this article we focus on observations arising from transmissions in a network. Specifically, 

each co-occurrence observation corresponds to a path 1 through the network. We observe the vertices 

throughout this paper a "path" refers to a sequence of vertices (xi,X2, ... ,xn) such that there is an edge between 
each adjacent pair of vertices, Xi-i and Xi, and no node appears more than once in the sequence. 
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comprising each path but not the order in which they appear along the path. In certain applications 
the endpoints (source and destination) of the path may also be observed. 

Our goal is to identify which pairs of vertices are directly connected via an edge, thereby learning 
the structure of the network. A feasible graph is one which agrees with the observations; i.e., a 
graph which contains a directed path through the vertices in each co-occurrence observation. Given 
a collection of co-occurrence observations a feasible graph is easily constructed by assigning an order 
- any order, in fact - to the vertices in each observation, and then inserting directed edges between 
vertices which are adjacent in the assigned order. In light of the many possible orders for each 
co-occurrence observation, the number of feasible topologies grows exponentially in the number 
and size of observations. Without additional assumptions, side information, or prior knowledge, 
there is no reason to prefer one feasible topology over the others. 

Previous work on related problems has involved heuristics using frequencies of co-occurrence 
either to assign an order to each path [21] or to approximate the probability of transitioning from 
one vertex to another [14]. These approaches make stringent assumptions and sacrifice flexibility 
in order to achieve computational tractability and systematically identify a unique solution. The 
frequency method introduced in [21] is based on a model where paths from a particular source 
or to a particular destination form a tree. This model coincides with the shortest-path routing 
policy. When the network provides multiple paths between the same pair of endpoints (e.g., for 
load-balancing) the algorithm may fail. The cGraph algorithm of Kubica et al. [14] inserts weighted 
edges between every pair of vertices which co-occur in some observation. This approach produces 
solutions which are typically much denser than desired. Because both of these methods are based on 
heuristics, the results they produce are not easily interpreted. Also, these heuristics do not readily 
lend themselves to incorporating side information. A different approach, introduced by Justice and 
Hero in [12], involves averaging over an ensemble of feasible topologies sampled uniformly from 
the feasible set. In general there is an enormous number of feasible topologies (exponential in 
the problem dimensions) exhibiting a wide variety of characteristics, and it is not clear that an 
average of feasible topologies will be optimal in any sense. These observations have collectively 
motivated our development of a more general approach to network reconstruction which we simply 
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term network inference from co-occurrences, or NICO for short. 

Our approach is based on a generative model where paths are realizations of a random walk 
on the underlying graph. A co-occurrence observation is obtained by randomly shuffling each 
path to account for our lack of observed order information. Based on this model, network inference 
reduces to estimating the parameters governing the random walk. Then, these parameter estimates 
determine the most likely order for each co-occurrence. 

The following interpretation motivates our shuffled random walk model. Imagine sitting at a 
particular vertex in the network and observing a series of transmissions pass by. This vertex is 
only connected to a handful of other vertices in the network, so regardless of its final destination, a 
transmission arriving at this vertex must pass through one of the neighboring vertices next. Over a 
period of time, we could record how many arriving transmissions are passed to each neighbor, and 
then calculate the empirical probabilities of transmissions to each of the neighbors. Obtaining such 
probabilities at each vertex would provide a tremendous amount of information about the network, 
but unfortunately co-occurrence observations do not directly reveal them and we therefore face a 
challenging inverse problem. This paper develops a formal framework for estimating local transi- 
tion probabilities from a collection of co-occurrence observations, without making any additional 
assumptions about routing behavior or properties of the underlying network structure. Experi- 
mental results on simulated topologies indicate that good performance is obtained for a variety of 
operating conditions. 

It is also worth mentioning that the approach discussed in this paper differs considerably from 
that of learning the structure of a directed graphical model or Bayesian network, a graph where 
nodes correspond to random variables and edges indicate conditional independence relationships 
[8,9]. A typical aim of graphical modelling is to find a graph corresponding to a factorization of 
a high-dimensional distribution which predicts the observations well. In turn, these probabilistic 
models do not directly reflect physical structures, and applying such an approach in the context 
of this problem would ignore physical constraints inherent to the observations: that co-occurring 
vertices must lie along a path in the network. 

The rest of the paper is organized as follows. In Section[2]we introduce notation and formulate 
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the problem setup. Section |3] reviews the standard approach to estimating the parameters of a 
random walk when fully observed (ordered) samples are available and presents an EM algorithm 
for estimating random walk parameters from shuffled observations. A Monte Carlo variant of the 
EM algorithm is described in Section 21 for situations where long transmission paths make the exact 
E-step computation prohibitive. Section \5\ analyzes convergence of the Monte Carlo EM algorithm. 
Simulation results are presented in Section Eland the paper is concluded in Section [71 where ongoing 
work is also briefly described. 

2 Problem Formulation 

We model the network as a simple directed graph G = (S,E), where S = {1,2, ... , \S\} is the 
set of vertices/nodes and E C S x S is the set of edges. The number of nodes, is considered 
known, so network inference amounts to determining the adjacency structure of the graph; that is, 
identifying whether or not E E, for every pair of vertices E S x S. 

A co-occurrence observation, y C S, is a subset of vertices in the graph which simultaneously 
"occur" when a particular stimulus is presented to the network. For example, when a transmission 
is made over a communication network, a subset of routers and switches carry the transmission from 
the source to the destination. This activated subset corresponds to a co-occurrence observation, 
with the stimulus being a transmission between that particular source-destination pair. By repeat- 
ing this procedure T times with different stimuli we obtain observations, y = {y^,y' 2 ', • • • ,y^}, 
where y( m ) = (yj m , y^ , . . . , y^ 1 ) is a length- N m co-occurrence, indexed in an arbitrary order. 

A directed graph G = (S, E) is said to be feasible with respect to observations y if for each co- 
occurrence y( m ) G y there exists an ordered path z( m ) = (z^, z!^\ . . . , z^) and a permutation 
T ( m ) = (t[ , . . . ,t^) such that z\ m ^ = y ty\ for each t, and there is an edge from zt—\ to z t in 
the graph for t = 2, . . . , N m , that is, (zt-i, z t ) E E. 

Notice that if we observed ordered paths then network inference would be trivial. We would 
begin with an empty graph Go = (S,E) with E = {}. Then, for each ordered observation we 
would update the set of edges via E <- E U (45 , 4 m) ) for t = 2 > • • • , Nm- Similarly, if we observed 
the correct permutation r^" 1 ) along with each co-occurrence y^ m \ we could invert the permutation 



6 



to recover ordered observations and apply the same procedure. 

In practice we do not make ordered observations nor do we have access to the correct permu- 
tations. However, we can obtain a feasible reconstruction by associating any permutation (of the 
appropriate length) with each co-occurrence, and then following the procedure described above. 
There are N m \ ways to permute the elements of y( m ), so there may be as many as \\^n = iN m \ 
feasible reconstructions. Clearly, for large N m and T this is a huge set to search over. Moreover, 
without making additional assumptions, or adopting some additional criteria, there is no reason to 
prefer one feasible reconstruction over another. 

Physical principles governing the development of many natural and man-made networks suggest 
that not all feasible networks are equally plausible. Intuitively, if two or more vertices appear 
collectively in multiple co-occurrences, we expect that their order is probably the same in the 
corresponding paths. Likewise, we expect that most vertices will only be directly connected to a 
small fraction of the other vertices. Based on this intuition we propose the following probabilistic 
model. First, we model the unobserved, ordered paths, z*" 1 ), as independent samples of a first- 
order Markov chain. The Markov chain is parameterized by an initial state distribution it 6 [0, 
where 7Tj = P\z\ = i], and a probability transition matrix, A G [0, l]' 5 ' x ' 5 ', where Aij = P[z t = 
j\z t -\ = i\. Of course, these parameters must satisfy the normalization constraints 

\S\ \s\ 

y~] 7Tj = 1 and ^2 = 1, for each i = 1, . . . , \S\. (1) 

i=l j=l 

In addition, we assume that the support of the transition matrix is determined by the adjacency 
structure of the underlying network; i.e., Aij > if and only if G E. 

A co-occurrence observation, y, is generated by shuffling the elements of an ordered Markov 
chain sample, z = (zi, . . . , zjy), via a permutation r drawn uniformly from the collection of 
all permutations of iV elements. Thus, for each t = 1,...,N, zt = y n . We assume that r is 
independent of the Markov chain sample, z. Based on this model, we can write the likelihood of a 
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co-occurrence observation y conditioned on the permutation r as 



A' 



P[y|r,A,7r] = tt^ JJ A Vrt _ x jVt 



(2) 



t=2 



Since P[t] = 1/(N\), for any r G Vl/jv, marginalization over all permutations leads to 



P[ y |A,7r] = ^ ^ P[y|r,A,7r]. 



(3) 



re*, 



Finally, assuming that co-occurrence observations are independent, and taking the logarithm, gives 



logP[y|A,7T] = 



m=l 



log) Yl P[y {m) \T (m) ,A,7v] ] -log(iV m !) 



(4) 



Under this model, network inference consists in computing the maximum likelihood (ML) estimates, 



(Aml,7?ml) = argmax logP[J^| A, tt]. 

A,7T 



(5) 



With the ML estimates in hand, we may determine the most likely permutation for each co- 
occurrence observation according to (Aml,ttml), and obtain a feasible reconstruction using our 
procedure for ordered observations described above. 

For non-trivial observations, log P[y\ A, ir] is a complicated, non-concave function of (A, tt), so 
solving © is not a simple task. In the next section, we derive a EM algorithm for solving this 
optimization problem, by treating the set of permutations, T = {t^\ t^ t '}, shuffling the paths, 
as missing data. 
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3 An EM Algorithm for Estimating Markov Chain Parameters 
from Shuffled Observations 



3.1 Fully Observed Markov Chains: Notation and Estimation 

Let Z = {zW, z( T )} be a set of sample paths, z( m ) = (z^, zffl), independently generated 
by a Markov chain with transition matrix A and initial state distribution tt (see Q). For later use, 
it is convenient to introduce the equivalent binary representation w^" 1 ) = (w 1 ^, ...,wjy ), for each 
sample z^ m \ defined as follows: w| m ' = (w[™\ . . . , w^?l ) € {0, l}' 5 !, with {w'ff 1 = 1) 44> = i); 
of course, one and only one entry of each vector w| m ' equals 1. Finally, let W = {w*- 1 -*, w^ T ^}, 
which contains the exact same information as Z. With this notation, we can write 

T \S\ T N m \S\ \S\ 

m=l i=l m=l t=2 i=l j=X 

\S\ T \S\ \S\ T N m 

= e log vr, e «#> + E E lo § ^ E E «£t -& } • 

i=l m=l i=l 3=1 m=l t=2 

Maximum likelihood estimates of 7r and A can be obtained from W my maximizing log P[W| A, 7r] 
under the constraints in ([1)1: the solution is well known, 



T Nr> 

EE«1 



(m) (m) 



A,:.,- = m=1 1=2 , and % = — E ™i™ ■ ( 6 ) 



tJ |S| t iv m ' ' T 

(m) (m) m=1 
°t-l,< <j 

j=l m=l t=2 



EEE "' ; 



3.2 Shufflings, Permutations, and the EM Algorithm 

To address the case where we have a set of co-occurrences y = {y^> ■ ■ • , y^} ; not ordered samples, 
we defined the equivalent binary representation X = {x^ 1 ) , . . . , x^} for y in a similar way as above: 

xH = (x^ 5 ...,x£), where ^ = 0$°, * {°> with (4f = 1) «> (% M = 0- 

Rather than using r^ m ^ = (r} m , . . . , rjy ) to denote the mth permutation/shuffling, we in- 
troduce a more convenient (binary) representation; each shuffling is represented by a permutation 
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matrix*, which we will term shuffling matrix. Let the shuffling matrix for sequence m be denoted 
as r(' m ) so that (rj™ } = 1) & (r t = if) <fr (x^ ) = wj m) ) (yj, m) = z^). Given both r ( m ) and 
x ( m ) ) we recover the unshuffled sequence w( m ) by applying (using 0° = 1) 



N m , _ _(m) 

(m) _ 77 ( Jm)Yt,t> 
J t,i ~ 

t'=l 



Let 7?. = {r^,...,r^} be the collection of shuffling matrices that allow recovering the un- 
derlying ordered paths W = {w^ 1 ), w( T >} from the corresponding shuffled co-occurrences X = 
{xW, ...,x.^}. We can write the complete log-likelihood log-P[Vt,7£|A, iv] as follows: 

]ogP[X,n\A,ir] = \ogP[X\H, A,tt] + logp[K} (8) 

T 

= ^logP[x( m )|r( m ),A,7r]+logp[^] 

m=l 

T N m N m N m \S\ 

= E E E E E # } 41- 4? 

m=l t=2 t'=l t"=l i,j=l 
T N m \S\ 

+ E E E r W 4? logTi + i°gp[n (9) 

m=l t'=l i=l 

where p[7£] is the probability of the set of shufflings TZ, which we assume constant. 

To estimate A and 7r from X, we treat 1Z as missing data, opening the door to the use of the 
EM algorithm. Notice that if we had the complete data (X,TZ), we could recover W via Q and 
obtain the closed- form estimates (jHJ). The EM algorithm proceeds by computing the expected value 
of log P[X, TZ\ A, 7r] (w.r.t. 7Z), conditioned on the observations and on the current model estimate 
(A fc ,7r fc ) (the E-step), 

Q(A,7v;A k ,7v k ) = E\logP[X,TZ\A,7v] X,A k ,w k . (10) 



2 A matrix with one and only one "1" in each row and each column. 
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The model parameter estimates are then updated as follows (the M-step): 



(a* +1 , 7T fc+1 J = argmax Q ^A, it; A fc , ir l 



(11) 



These two steps are repeated cyclically until a convergence criterion is met. 

3.3 The E-step 
3.3.1 Sufficient statistics 

Rearranging ©, and dropping logP[7£] (assumed constant), we can write 

T \S\ N m N m 

m=l i,j=l t',t"=l t=2 
T \S\ N m 

m=l i=l t'=l 

revealing that log P[X, 1Z\ A, 7r] is linear with respect to simple functions of 1Z: 
• the first row of each r( m ): r^??, for m = 1, . . . , T and i' = 1, . . . , N m ; 



(12) 



sums of transition indicators: a^™l = Ylt^^tt'' r t-lt"> ^ or m = li---)^' an( l ^'>* 



l,...,iV m . 



Since expectations commute with linear functions, the E-step reduces to computing the conditional 
expectations of r^) and ajiTjl (denoted f^?? and respectively) and plugging them into the 

complete log-likelihood. Noticing that the rY?/ and aj™// are binary (in {0, 1}), yields 



_(m) 

w 



_(m) 
a t',t" 



„( m ) 
l,t' 



(m) 



^,A fc ,7T fe 



P 
P 



(m) 



^,A fc ,7T fc 



A\ A* 7T* 



(13) 

(14) 



Finally, Q (A, 7r; A fc , 7r fc ) is obtained simply by plugging r[™, and a|™l in the places of r[™, and 
r t_i,t"i respectively, in (JT2J). 
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3.3.2 Computing 

Since the permutations are (a priori) equiprobable, we have P[r] = l/(N m \) (for r G ^N m ), 
P[r { ™) = 1] = ((JV m - l)!/iV m !) = l/N m , and P[r\r[™) = 1] = l/((iV m - 1)!). Using these 
facts, together with the mutual independence among the several sequences, and Bayes law, yields 



P 



1 



x (m) )A fc )7r fe 



p[ x (m)| r M = i, A k ,7r k ] P[r^J = 1] 



(m) 



r€*iv m : r l t /=l 



P[ x M|A fe ,7T fc ] 

P[x( m )|r,A fe ,7r fe ] 



P[x (m) \r, A k ,7r k 



(15) 



where each term P[x( m ) |r, A k , 7r fc ] is easily computed after using r to unshume x 



v„ 



P[x( m )|r,A fc ,7r fc ] =P[y( m )|r,A fc ,7r A; ] =7r fc (m )II A * 



, , (m) (m) • 



Denoting the numerator of (|15|) as we have a more compact expression 



_(m) _ ^t- 
1.*' ~ iV„ 



(m) 



t'=i 



(m) 
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3.3.3 Computing oi™i 

The computation of a^ 1 ),, follows a similar path as that of f^?, ; since all permutations are equiprob- 
able, P[rlfr^l tll = 1] = (N m - 2)!/(iV m !) and P[r\r<$ r^,, = 1] = l/((iV m - 2)!), thus 



t=2 



p\ (m)| „M „M _i A fc fc] P r r M (m) _ -,-1 
r L x l r M' r t-i,t" — ■ L > A J ^L r t,t' r t-l,t" — L l 



t=2 



P[x( m )|A fc ,7T 



J] P[x( m )|r,A fc ,7r 



Mir A'^l W ~ 2 ^ ! 



^ AT! E ^[x (m) |r,A^] 



N m \ 

N, 



m ' ,-,T, 



^ P[x( m )|r,A fc ,7r fc ]^r M ,r t _ lit » 
P[ X ( m )|r,A fc ,7r fc ] 

Denoting the numerator of ((To]) as 7^1, we finally have 



(m) 

_(m) 7t',t" 



iVm 

V-Y (m) 

2^ 7 *',t" 

t'=l 



(16) 



(17) 



For the mth observation, the statistics {r^}} and {ai™)^,} have an 0(N^ l ) memory cost (iV^ — 
N m transition statistics and N m initial state statistics). These quantities can be computed via the 
summary statistics {7 t / } and {7^//}, using the same memory needed to store {r^)} and {a tt t>,t"}, 
in 0(iV m !) operations (the number of all permutations for a length N m observation). For large 
N m , this is a heavy load; Section 0] describes a sampling approach for computing approximations 
to f\ f and a t > t" ■ 
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3.4 The M-step 



(m) 



Recall that the function Q (A, 7r; A k , 7r fc ) is obtained by plugging f\t> and cif f in the places of r\ 
and X^t^2 r t't' r t™lt"> respectively, in (|12|). Maximization w.r.t. A and 7r, under the constraints 
in leads to following simple update equations: 

T N m T N m 

^( m ) »M (m) \ ^ \ -> _(m) (m) 

fc + 1 = m=ltf,f=l fc+1 = m= l g=l (18) 

V V V Am) Am) (m) \ ^ \^ \^ =(m) Am) 

/ j / j / j a t l ,t" JL t" ) i x t',j / j / j / j '1,1/ ^t'.i 

j=l m=l t',t"=l i=ltn=l t'=l 

3.5 Handling Known Endpoints 

In some applications, (one or both of) the endpoints of each path are known and only the internal 
nodes are shuffled. This is the case in communication networks (i.e., internally-sensed network 
tomography), since the sources and destinations are known, but not the connectivity within the 
network. In estimation of biological networks (signal transduction pathways), a physical stimulus 
(e.g., hypotonic shock) causes a sequence of protein interactions, resulting in another observable 
physical response (e.g., a change in cell wall structure); in this case, the stimulus and response act 
as fixed endpoints, our goal is to infer the order of the sequence of protein interactions. 
Observe that knowledge of the endpoints of each path imposes the constraints 



(m) -. , (m) 

"l l = 1 and r' N N 



Under the first constraint, estimates of the initial state probabilities are simply given by 

i T 

m=l 

Thus, EM only needs to be used to estimate the transition matrix entries. Let 

= {r e $> N ■ n,i = 1, r N;N = 1}, 



771=1 
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denote the set of permutations of N elements with fixed endpoints. As in the general case, the 
E-step can be computed using summary statistics (for t' , t" = 1, . . . , N m ) 

and setting a^„ = 7^1/7- The M-step (update for A fc+1 ) remains unchanged. 
3.6 Incorporating Prior Information 

The EM algorithm can be easily modified to incorporate conjugate priors; these are Dirichlet priors 
for 7r and for each row of A, 

\s\ \s\ \s\ 

P[tv\u] ocn< 1_1 and P[A\v] oc ]J J] A^f\ (19) 

i=l i=l j=l 

where the parameters Uj and Vij should be non-negative in order to have proper priors [2]. The 
larger U{ is relative to the other itj', i' ^ i, the greater our prior belief that state i is an initial state 
rather than the others. Similarly, the larger Vij relative to other Viji for f ^ j, the more likely we 
expect, a priori, transitions from state i to state j relative to transitions from i to the other states. 

Adding the logarithms of the priors in ()19|) to the complete log- likelihood ©, we find that 
incorporating priors into the EM algorithm only results in a change to the M-step. Consider the 
prior distribution on the initial state distribution; taking m = c > 1, for all i, has a smoothing 
effect, encouraging all of the states to have some mass in the initial state distribution. On the other 
hand, with < c < 1 will have a shrinkage effect, encouraging all of the mass to go to one (or a 
few) of the states. We can push even more aggressively for a sparse solution by choosing negative 
parameters for the Dirichlet distributions (which will become improper), as done in [7] for Gaussian 
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mixtures. When negative Dirichlet parameters are allowed, the M-step updates become 

/ T N m \ 
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where (a)+ = max{0, a} is the so-called positive part operator. 

4 Monte Carlo E-Step by Importance Sampling 

For long sequences, the combinatorial nature of Q15j) and ()16|) (involving sums over all permutations 
of each sequence) may render exact computation impractical. In this section, we consider Monte 
Carlo approximate versions of the E-step, which avoid the combinatorial nature of its exact version. 
The Monte Carlo EM (MCEM) algorithm, based on an MC version of the E-step, was originally 
proposed in [26], and used ever since by many authors (recent work can be found in [3, 11] and 
references therein). 

To lighten the notation in this section, we drop the superscripts from (A k , ir k ), using simply 
(A, 7r) as the current parameter estimates. Moreover, we focus on a particular length- iV path 
y = {yi, . . . , yj\r} C S N and drop the superscript (m); due to the independence of the paths, 
there's no loss of generality. Recall that y is a (shuffled) path, thus has no repeated elements. 

The E-step (see (jlHjl and (|14jl) consists of computing the conditional expectations f\ t' = 
E [ri t t> x, A, 7r] and ctt> t t" = E [at',t" | x i A, 7r] . A naive Monte Carlo approximation would be based 
on random permutations, sampled from the uniform distribution over ^jy. However, the reason to 
resort to approximation techniques in the first place is that is large, with only a small fraction 
of these random permutations having non-negligible posterior probability, P[r|x, A, 7r]; a very large 
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number of uniform samples is thus needed to obtain a good approximation to fif and atf t"- 

Ideally, we would sample permutations directly from the posterior P[r|x, A,7r]; however, this 
would require determining its value for all iV! permutations. Instead, we employ importance sam- 
pling (IS) (see, e.g., [15,22], for an introduction to IS): we sample L permutations, r 1 ,...,r i , 
from a distribution P[r|x, A,7r], from which it is easier to sample than P[r|x, A,7r], then apply 
a corrective re- weighting to obtain approximations to f\ # and a ( / 1 « (note that we are now using 
superscripts on r to index sample numbers, not to identify paths). The IS estimates are given by 
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where Zi, the correction factor (or weight) for sample r*, is given by 

_ P[r*|x,A, ff ] 
l ~ P[^|x,A,7r]' l24j 

the ratio between the desired distribution and the sampling distribution employed. 

A relevant observation is that the target and sampling distributions only need to be known 
up to normalizing factors. Given R'[r] = Zr R[r\x, A, ir] and P'[r|x, A,7r] = ZpP[r\x., A,7r], for 
constants Zr and Zp, we can use 

_ PVjx^A^TT] _ Z P 

^ - P'[r*|x,A,7r] " Y~ R Z *> (25) 

instead of Zi in (|22j) and (|23|) ; these sums will remain unchanged since the factor Zp / Zp will appear 
both in the numerator and denominator, thus cancelling out. 

The remainder of this section contains the description of an IS scheme, including the derivation 
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of closed form expressions for both the sampling distribution, R, and the sample weights, Zj. We 
conclude the section by mentioning other sampling variants and presenting experimental results. 

4.1 Sampling Scheme 

Let f = {/i,...,/|5|}e{0,l}l s 'lbea sequence of binary flags. Given some probability distribution 
P = {PiiP2, • • • :P\s\} on the set of states, S, denote by p • f the restriction of p to those elements 
of S that have corresponding flag fa set to 1, that is, 

(p-f)i= {s Plft , fori = l,2,...,|S|. (26) 

J2pj fj 

j'=i 

The proposed sampling scheme is defined as follows: 

Step 1: Let f = {/i, . . . , f\s\} be initialized according to fi = I{ iey y, where Iq denotes the indi- 
cator function. 

Obtain one sample from S according to the distribution 7v ■ f . Let the obtained sample be 
denoted s; of course, one and only one element of y is equal to s. 

Locate the position t of s in y; that is, find t such that yt = s. Set t\ = t. 

Set f s = (preventing y t from being sampled again). Set i = 2. 

Step 2: Let p = {Ag,i, . . . , ^4 Sj |5|} be the sth row of the transition matrix. 
Obtain a sample s' from S, according to the distribution p • f . 
Find t such that yt = s' . Set n = t. Set f s > = 0. 

Step 3: If i < N, then set s <— s', set % <— % + 1, go back to Step 2; otherwise, stop. 
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4.1.1 Sampling Distribution 

Before deriving the form of the distribution R, let us begin by writing the target distribution 
P[r|y, A, 7r] explicitly. Using Bayes law, 

P[r|y,A, W ]- p[y|A)7r] , (27) 

since r does not depend a priori on A or 7v. Based on our assumption that all permutations 
are equiprobable we have P[t] = I^ Te q, N y/N\. Noticing that the denominator in ((2*7|) is just a 
normalizing constant independent of r, we have 



P[r|y,A,7r] oc I{re* N } P[y\r,A,iv} = I{ T e* N } ]1 A/r^^l 



(28) 



For the sake of notational economy, we will write simply R[t] to represent i?[r|y, A, 7r]. The 
sequential nature of the sampling scheme suggests a factorization of the form 

R[t] =R[tj] R[t 2 \tx] R[t2\t 2 ,tx] ••• R[t n \t N -i,...,ti]. (29) 

For Step 1 of the sampling scheme, it's clear that, for t\ = 1, . . . ,N, 

R[Tl]<XTTy T1 . (30) 

For the i-th iteration, we have, 

R[Ti\Ti_i,.. .,n] = A yT ._^y T . ^(rj_i,.. . ,ti) I{ n f{ n _ u ..., T1 }}, (31) 

with 



-i 



i t^{Ti_l,...,Tl} 



19 



Inserting (J3(J|) and (|31j) into ()29|) . we finally have 
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iV 



1 ri 



-R[r] oc 7r, 



n^( 



'•••' ri ) n^^f 



Ti_l,...,Tl}} 



(32) 



. Li=2 



. Li=2 



observe that the third factor in the r.h.s. of (|32j) is simply the indicator that r is a permutation, 
i.e., is equal to I^ Te ^ N y, for any r G {1, N} N . 

Dividing 1)28(1 by (|32|) we obtain the correction factor z for a permutation sample r generated 
using this sequential scheme as 



With this quantity in hand, we have all the ingredients needed to produce IS estimates of f\ y an d 
Notice that computing the terms (pi, thus computing z, is easy since these factors are the 
normalization terms for the distributions p • f , which are already computed while performing each 
iteration of Step 2. Thus, we just need to store the product of these normalizing constants to finally 
obtain the weight z. 

4.1.2 Known Endpoints 

In the case where the endpoints are known, we fix r% = 1, rjv = N, and set fi = and = in 
Step 1; the remainder of the procedure is unchanged. Based on these constraints, the importance 
sampling weight takes a slightly different form: 

AT-l 

i=2 ti{r i - 1 ,...,T 1 } 

4.2 Hierarchical Sampling Schemes 

In addition to the sampling scheme that we have just described, we have also developed other 
sampling schemes that work in a hierarchical, rather than sequential, fashion. For the sake of space, 
we refrain from describing these other sampling schemes; detailed descriptions can be found in [20]. 




(33) 
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In particular, we have developed a two-stage hierarchical scheme and a fully hierarchical scheme. 
In the two-stage method, the first stage samples from the collection of all possible transitions 
occurring in a path; then the second stage samples from the distribution on all arrangements of 
these transitions, to form a permutation. In the fully hierarchical method, the first stage samples 
a suitable set of transitions, say Qi, then, the following stage samples a suitable collection of pairs 
of elements of Qi, yielding a collection of quadruples, Q2, and the procedure is repeated until a 
permutation is obtained. 

4.3 Performance Assessment 

A standard error metric for comparing two distributions P and P taking values on the finite set 
is the i\ distance, defined as 



\p-p\\i = | p M - p[ t ] 

rG*jv 



(35) 



Given a set of permutations, {r 1 , . . . , r L }, drawn from the sampling distribution R along with the 
corresponding weights, z\, . . . ,zl, we can compute the empirical distribution Pr induced on 
according to 

P R [r] = (jrz i ) Y^Zil^y (36) 



vi=l 



i=l 



Notice that the Monte Carlo sufficient statistics afy, and r^J are just sums of terms Pr[t], e.g., 



«f,f = Ere** p rH £t=l r *-i,*'' 7 v- Thus > 



_(m) ^(m) 
a t',t" ~ a t't" 



< 



P-Pr 



and 



'i,t' 'i,t' 



< 



P-Pr 



showing that if the t\ error between the true distribution on permutations and the empirical 
importance sampling distribution is small, then all of the estimated sufficient statistics will be close 
to the corresponding exact value. 

We have evaluated the performance of various sampling schemes via simulation, considering 
three scenarios concerning the distribution over all permutations: 1) roughly uniform, 2) moderately 
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peaked, and 3) highly concentrated around just a few permutations. The scenario 1) is typical of 
the first EM iterations; scenario 2) is typical during intermediate EM iterations; the third scenario 
is typical when EM is near convergence. We consider a length-8 path with known endpoints, thus 
with 6! = 720 possible orderings. This length is long enough to suggest how each sampling scheme 
will perform for longer paths, while still allowing enumeration of all orderings. 

Figure n depicts the i\ error between the true and IS-based distributions on permutations, as a 
function of the number of samples. The curve labelled "True Dist" corresponds to sampling from 
the true distribution, shown as a reference, is only possible when we can enumerate all permu- 
tations. The "Causal IS" curve corresponds to the scheme described in Section 14.11 The "Two 
Stage" and "Hierarchical" curves depict the performance for the hierarchical schemes mentioned in 
Section f4. 21 Finally, "Random" refers to an approach where we sample from a uniform distribution 
on permutations, shown as a baseline comparison. Each curve was generated by averaging over 
50 Monte Carlo simulations. These curves depict performance using up to 500 samples for a path 
with 720 possible orderings, which is actually quite a generous helping of samples. In experiments 
with Internet data we have encountered paths of up to length 27, and observed good reconstruction 
performance using as few as 2000 samples (notice that 27! ~ 10 28 ). Thus, performance for very few 
samples is of great interest. As expected, all of the sampling schemes give the same performance 
when distribution is roughly uniform. However, as the distribution becomes more concentrated, 
there is a clear difference between the various sampling schemes. The uniform sampling scheme 
naturally performs the worst on more concentrated distributions. Of the schemes that are practical 
for long paths, these results indicate that the causal sampling scheme performs the best. 

In terms of computational complexity, the causal sampler is the simplest and fastest, requiring 
only O(N) operations per sample (N is the path length). The two-stage sampler requires 0((A r /2)!) 
operations per sample, while the fully hierarchical has computational complexity 0(N 2 log N) op- 
erations per sample. The conclusion is that the causal sampling method described above is simple 
to implement, fast, and it empirically outperforms more computationally complex schemes. 

To compare the efficacy of each sampling scheme within the EM algorithm, we generated a 
random network with 250 nodes and simulated 60 random sample paths through this network, 
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Figure 1: t\ error as a function of the number of importance samples drawn for various sampling 
schemes in the following scenarios: (a) a roughly uniform distribution on the permutations, (b) 
moderately peaked distribution, (c) highly concentrated distribution. The curves in each figure 
were calculated by averaging over 50 Monte Carlo simulations. 



ranging in length from 4 to 10 hops. Then, we estimated a probability transition matrix for the 
network using the EM algorithm with different IS-based E-steps (with known endpoints for each 
path) . Figure [2 depicts the marginal log-likelihood of the data, computed according to Q using 
the probability transition matrices returned by the EM algorithm, for a number of samples-per- 
path between 20 and 100. The horizontal dashed line at the top marks the marginal log likelihood 
computed using a transition matrix estimated from correctly ordered paths. 



5 Monotonicity and Convergence 

Well-known convergence results due to Wu and Boyles [4, 27] guarantee convergence of our EM 
algorithm when the exact E-step is used. Let 6 k = (A k , 7r fc ) denote parameter estimates calculated 
at the kth EM iteration using the exact EM expressions. By choosing k+l = (A k+1 , 7r k+1 ) 
according to (jllj) in the M-step, our iterates satisfy the monotonicity property: 

Q(> +1 ;0 fc ) > Q(e k ;G k y (37) 

The marginal log-likelihood (j3J is continuous in its parameters A and 7r and it is bounded above. In 
this setting the monotonicity property (|37ft guarantees that each exact EM update monotonically 
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Figure 2: Using various approximate E-steps in the EM algorithm for estimating the Markov 
transition matrix of a simulated network. The horizontal dashed line at the top of the figure marks 
the marginal log likelihood of the data using the transition matrix derived using the correctly 
ordered paths. The curves in this figure correspond to the average over 10 Monte Carlo simulations. 

increases the marginal log-likelihood, so the EM iterates converge to a local maximum. 

When Monte Carlo methods are used in the E-step monotonicity is no longer guaranteed since 
the M-step solves 

6 k+1 = (A k+1 ,Z k+1 ) = argmaxQ(A,7r;A fc ,7r fe ), 

where Q is defined analogously to Q but with terms ttf^li and f^) replaced by a^X and fj"^ , their 
corresponding importance sampling approximations. Consequently, care must be taken to ensure 
that Q approximates Q well enough so that the EM algorithm is not swamped with error from the 
Monte Carlo estimates. 

To illustrate this issue, consider the following synthetic example. We generate 40 co-occurrence 
observations by taking a random walk on a graph with 140 vertices. Each co-occurrence has between 
4 and 8 vertices. Figure |3f a) plots Q(6 k ;0 k ~ 1 ) for the exact E-step, along with Q(6 ;0 ) and 
Q(0 ; 9 ) for the Monte Carlo EM algorithm using only 10 importance samples per co-occurrence. 

^ ^ ^ ^ -^k A k \ X — h 

Although Q(0 ; ) increases at each iteration, Q(0 ; ) clearly does not and the monotonicity 
property does not hold. This is apparent in Figure|2Ib), where the dash-dot line shows the progress 
of the marginal log-likelihood (our optimization criterion) for the 10 sample Monte Carlo EM 
algorithm. When enough importance samples are used the Monte Carlo EM algorithm performs 
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Figure 3: An example with simulated observations illustrating that the Monte Carlo EM algorithm 
may not result in monotonic increase of the marginal log-likelihood if too few Monte Carlo samples 
are used. The solid line in (a) is Q(6 k+1 ; k ) for exact EM iterations, the dashed line is Q{6 ; 6 ) 

and the dash-dot line is Q{6 ; ) for Monte Carlo EM iterations using only 10 samples. Even 
though Q increases monotonically, Q may not be monotonic for the Monte Carlo EM algorithm. 
Figure (b) depicts the marginal log-likelihood for exact EM iterates and for two versions of the 
Monte Carlo EM. Monte Carlo EM performance closely resembles that of the exact EM algorithm 
when sufficiently many importance samples are used. 

comparably to the exact EM algorithm; see the dashed line in Figure|Hfb) corresponding to a Monte 
Carlo EM algorithm using 1000 importance samples per co-occurrence. All three instances of the 
EM algorithm used in this example start from the same initialization. 

Recently, researchers have considered the question of how many importance samples should be 
used in a Monte Carlo E-step [3,5,11]. The goal is to balance monotonicity and computational 
efficiency by using enough samples to have a good chance at monotonicity while not using excessively 
many samples. Booth et al. [3] argue that if the same number of importance samples is used at each 
EM iteration, then the algorithm will eventually be swamped by Monte Carlo error and will not 
converge. They also suggest requiring that a convergence criterion be satisfied on multiple successive 
iterations since the criterion may be met prematurely due to poor Monte Carlo approximations. 

Caffo et al. [5] propose a method for automatically adapting the number of Monte Carlo samples 
used at each EM iteration. To lighten notation, we drop the superscripts k and k + 1. Let A(0) = 
Q(0; 0')-Q{0'; 0') and A(0) = Q(0; 0')-Q(6'; 9'). Furthermore, let 6 = argmaxg Q(0; 0'), where 
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9' = 6 k was determined at the previous EM iteration. Recall that L importance samples are used to 
calculate Q. The algorithm of Caffo et al. is based on a Central Limit Theorem-like approximation 
in which they show that <</L(A(0) — A(G)) converges in distribution to the standard normal. 
Observe that the monotonicity property (|37|) is equivalent to the condition A(G) > 0. Although 
A (6) cannot be computed without computing the exact sufficient statistics and {f^li }, we 

can compute A(0). Their scheme then amounts to increasing the number of Monte Carlo samples 
until A(0) > e for a user-specified e > 0. Then, applying an asymptotic standard normal tail 
approximation, they obtain a statement of the form Pr (^A(O) — A(0) > e^j < 5(e). Based on this 
statement they claim that monotonicity holds with probability at least 1— 5(e). They further remark 
that if a different e k is chosen at each iteration, so that X^a-Li 5(e k ) < 00 > then, by the Borel-Cantelli 
Lemma, Pr (a(0) - A(G) > e k i.o.J = 0, so there exists a K > such that A(0) - A(0) < e k for 
all k > K with probability 1; i.e., eventually every EM update is monotonic. Of course, in practice, 
the algorithm is terminated after a finite number of iterations, so we may never reach the stage 
where all iterates are monotonic. 

Notice that for the monotonicity condition A(6) > to truly hold in the above framework, the 
events 

{A(0) - A(0) < e} and |a(0) > e} 

must occur simultaneously. Because the probabilistic bound above only addresses one of these 
events we refer to this type of result as guaranteeing an (e, 5)-probably approximately monotonic 
update, or PAM for short. More generally, an (e, <5)-PAM result states that with probability at least 
1 — 5, the update will be e-approximately monotonic; i.e., A(0) — A(0) < e implies A(6) > — e, 
because, by definition, A(0) > 0. 

Rather than resorting to asymptotic approximations to obtain such a result, we can take advan- 
tage of the specific form of Q in our problem to obtain the finite-sample PAM result next presented. 
Recall that independent importance samples are drawn for each co-occurrence observation in the 
Monte Carlo E-step. Denote by L m the number of importance samples used to compute sufficient 
statistics for observation x( m ). Exact E-step computation for this observation requires N m \ opera- 
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tions. Similarly, we should expect that larger observations will require more importance samples for 
two reasons: 1) there are more sufficient statistics associated with this observation (N^ in total), 
and 2) there are more ways to shuffle these observations. 

In the previous section we derived closed form expressions for the importance sample weights, 
Zi = -P[r|x, A, 7r]/i?[r|x, A, n], where P is the target distribution and R is the importance sampling 
distribution. A key assumption was made that P is absolutely continuous with respect to R; that is, 
P[r|x, A, 7r] = for every permutation r with i?[r|x, A, 7r] = 0. We adopt the convention 0/0 = 
so that Zi = for such samples. This guarantees that Zi < oo. The bounds below depend on the 
quality of our importance sample estimators as gauged by 

P[r|xM, A,tt] 

b m = max — — — t— r -. Ida) 

re*jv m ii[r|xM, A, it] V ; 

Because the set ^N m ^ s finite, P and R have finite support, and the maximum is well-defined. If R 
matches the target distribution P well then b m should not be very large. 

There is one other subtlety we will account for in our bounds. Because Q(0; 6') has terms log A%j 
and log7Tj, in practice we typically bound A^j and vrj away from zero to ensure that Q does not go 
to — oo. To this end, we will assume that Aij > 9 m i n and 7Tj > # m j n for some < 6> min < |<S| _1 . The 
upper bound on ^niin ensures it is still possible to satisfy the constraints (fTj) . Generally we choose 
$min very close to zero; at machine precision, for example. 

We have the following finite-sample PAM result for our Monte Carlo EM algorithm. 

Theorem 1. Let e > and 6 > be given and assume there exists 9 m \ n £ (0, |5| _1 ) such that 
A\a > #min and 7T- > 6 m \ n for all i and j. If 

2T 2 Njb 2 m |logfl min | 2 / 2Nl \ 

Lm = ? log {i-(i-sy/T) (39) 

importance samples are used for the mth observation then A(0) — A(6) < e with probability greater 
than 1 — 5. 

The proof of Theorem ^ appears in Appendix [XJ Because A(0) > by definition, the theorem 
guarantees that A(0) > — e with probability greater than 1 — 5. 
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Recall that exact E-step computation requires N m \ operations for the mth observation. The 
bound above stipulates that the number of importance samples required is proportional to log 
and generating one importance sample requires N m operations. Thus, the computational complex- 
ity of a PAM Monte Carlo update only depends on A^logiV^, which clearly demonstrates that 
the computational complexity of the Monte Carlo E-step depends polynomially on the N m in com- 
parison to exponential dependence for the exact E-step. 

To put this result in perspective, observe that the value of L m given by is roughly a factor 
of T away from the value we would expect based on an asymptotic variance calculation. Ignoring 
constants and log terms, for fixed 6 we have 



since independent sets of importance samples are used to calculate sufficient statistics for different 
observations. It is easily shown that the variance of an individual approximate statistic a^v,, or 
r^) decays according to the parametric rate; i.e., Var(aj™l) — \jL m . In total, there are 
sufficient statistics for the mth observation, and they are all potentially correlated since they are 
functions of the same set of importance samples. Then we have 



To drive Var yA(6)j down to a constant level, independent of T and N m , we need L m oc TN^. 

The additional factor of T in our bound is essentially an artifact from the union bound. 

Although the PAM result is pleasing, we would prefer to guarantee monotonicity with high 
probability, not just approximate monotonicity. Let 0* = argmax^ Q(6; 0'). By bounding A(0) — 



high probability. Instead of restricting Aij > 9 m \ n and 7Tj > # m i n , we need to assume the variables 






A(0*) instead of A(6)—A(0) we obtain the following stronger result guaranteeing monotonicity with 
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al, L and r\ J are bounded away from zero. This is stronger than the previous assumption in the 
sense that it implies Ajj and 7Tj are bounded away from zero. 

Theorem 2. Let 5 > be given and assume there exists A > such that ct^l, > A and f^?) > A, 



importance samples are used for the mth observation, then A(0) > with probability at least 1 — 5. 

The proof of Theorem[2]appears in Appendix^ Similar to the PAM bound given in Theorem^J 
the computational complexity required for a probably monotonic update also depends polynomially 
on N m . 

Note that L m depends on A(0*). By definition, A(6*) > at every iteration, and typically 
A(6*) is large at earlier EM iterations and approaches zero as the algorithm converges. This depen- 
dence supports the observation of Booth et al. mentioned earlier, that the number of importance 
samples ought to increase at each iteration. 

The main assumption of Theorem [21 is that the sufficient statistics are bounded away from zero 
at each iteration. We motivate this assumption by observing that if the algorithm is initialized 
with 7r^ > and ■ > for all i,j then, recalling the closed form E-step expressions, the sufficient 
statistics do not vanish in a finite number of EM iterations. 

Note that if we use different 5^ at each EM iteration, chosen such that X/fcLi < °°, then by 
the Borel-Cantelli Lemma one can argue that Pr (A(#) < i.o.) = 0. In other words, eventually 
all EM iterates result in a monotonic increase of the marginal log-likelihood. 

In addition to demonstrating that the Monte Carlo EM algorithm has polynomial computa- 
tional complexity, these bounds give a useful guideline for determining how many importance 
samples should be used. However, because they involve worst-case analysis, the numbers of sam- 
ples dictated by these bounds tend to be on the conservative side. For example, in the Internet 
experiments described in Section |H1 T = 249 and = 17. Theorem |2 suggests that roughly 72 
million importance samples should be used per observation. However, in our experiments we find 
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that the algorithm exhibits reasonable performance on this data set using as few as 2, 000 samples 
per observation. Of course, in practice, we do not have direct access to the parameters b m , A, or 
A(6*), so these bounds cannot be used as explicit guidelines. 

6 Experimental Results 

In this section, we evaluate the performance of our network inference from co-occurrences (NICO) 
algorithm on simulated data and on data gathered from the public Internet. In the results reported 
below, network reconstructions are obtained by first estimating an initial state distribution and 
probability transition matrix via the EM algorithm. Then, we compute the most likely order of 
each observation according to the inferred model and use this ordering to reconstruct a feasible 
network. The EM algorithm cannot be guaranteed to converge to a global maximum (the marginal 
log-likelihood is not concave) and there may even be multiple global maxima. To address this issue, 
we rerun the EM algorithm from multiple random initializations and report the collective results. 

We compare the performance of our algorithm with that of the frequency method (FM), defined 
in [21] and mentioned in Section ^ The FM also reconstructs a network topology by estimating an 
order of the vertices in each observation. This method individually determines each path ordering 
independently by sorting the elements in the path according to a score computed from pair-wise 
co-occurrence frequencies involving the source and destination of the path. It is possible that 
multiple vertices may receive identical FM scores, in which case their sorting would be arbitrary 
(one could exchange elements with identical scores without violating the FM criteria). In fact, we 
observe this phenomenon in many of our experiments. Ties are resolved by choosing a random order 
for elements with identical scores. Multiple restarts are also performed using the FM, yielding a 
collection of feasible solutions. 

The quality of a network reconstruction is determined by a quantity we term the edge symmetric 
difference error. Because the nodes in the network have unique labels, the goal of any reconstruction 
scheme is to determine which vertices are connected by an edge. The edge symmetric difference 
error is defined as the sum of the number of false positives (edges appearing in the reconstructed 
network which do not exist in the true network) and the number of false negatives (edges in the 
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true network not appearing in the reconstructed network). 
6.1 Simulated Networks 

Our synthetic data is obtained as described next. A network is generated according to a random 
geometric graph model: 50 vertices are thrown at random in the unit square, and two vertices are 
connected with an edge if the Euclidean distance between them is less than or equal to y / log(50) /50. 
This threshold guarantees that the graph is connected with high probability. Groups of nodes 
are randomly chosen as sources and destinations, transmission paths are generated between each 
source-destination pair according to either a shortest path or random routing model, and then 
co-occurrence observations are formed from each path. We keep the number of sources fixed at 
5 and vary the number of destinations between 5 and 40, to see how the number of observations 
effects performance. Each experiment is repeated on 100 different topologies, using 10 restarts of 
both NICO and the FM per configuration. Exact E-step calculation is used for observations with 
N m < 12, and causal importance sampling (2000 samples) is used for longer paths. The longest 
observation in our data was obtained by random routing and has N m = 19 (notice that 19! ~ 10 17 ). 

Figure 0] plots edge symmetric difference performance for synthetic data generated using (a) 
shortest path routing and (b) random routing. The edge symmetric difference error is computed 
between the inferred network and the graph obtained from correctly ordered observations. Of the 
10 solutions corresponding to different NICO initializations, we use the one based on parameter 
estimates yielding the highest likelihood score. For this simulation, the most likely NICO solution 
also always resulted in the best edge symmetric difference error. 

The FM does not provide a similar mechanism for ranking different solutions. A possible 
heuristic would be to choose the sparsest solution (with fewest edges). The figures show performance 
for both this heuristic, and clairvoyantly choosing the best (lowest error) solution of the 10. In 
fact, using the sparsest solution does better than just choosing a FM solution at random but not 
as well as using the clairvoyant best. In these simulations, NICO consistently outperform the FM. 

Notice that both algorithms exhibit their worst performance at an intermediate number of 
destinations. When very few destinations are used the measured topology closely resembles a tree, 
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Figure 4: Edge symmetric differences between inferred networks and the network one would obtain 
using co-occurrence measurements arranged in the correct order. Performance is averaged over 100 
different network configurations. For each configuration 10 NICO and FM solutions are obtained 
via different initializations. We then choose the NICO solution yielding the largest likelihood, and 
compare with both the sparsest and clairvoyant best FM solution. 

regardless of the underlying routing mechanism. Relative frequencies of co-occurrence accurately 
reflect the network distance of each internal vertex from the path endpoints. At the other extreme, 
when many destinations are used, there is significant overlap among the co-occurrence observations 
which aids in localizing vertices. In general, the FM seems to be much more sensitive to the amount 
of data available. 

As expected, the FM generally performs better on shortest path data than it does on random 
routes. When routes are generated randomly the corresponding topology is less tree-like and pair- 
wise co-occurrence frequencies do not reflect network distances. Because NICO is not based on a 
particular routing paradigm it performs similarly in both scenarios, possibly even a little better 
when routing is random. 



6.2 Internet Data 

We have also studied the performance of our algorithm on co-occurrence observations gathered from 
the Internet. Using traceroute we have collected data describing roughly 250 router-level paths 
from sources at the University of Wisconsin-Madison, the Instituto Superior Tecnico in Lisbon, and 
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Rice University to 83 servers affiliated with corporations, universities, and governments around 
the world. Our motivation for using this type of data is two-fold. First, traceroute allows us 
to measure the true order of elements in each path so that we have a ground truth to validate 
our results against. Second, and more importantly, the data comes from a real network where, 
presumably, paths are not generated according to a first-order Markov model. This allows us 
to gauge the robustness of the proposed model and to evaluate how well it generalizes to realistic 
scenarios. The ground truth network contains a total of 1105 nodes and 1317 edges, and the longest 
observed path has length 27. 

For this data set we rerun FM and NICO each from 50 random initializations and look at 
performance across all solutions rather than focusing on the maximum likelihood or clairvoyant 
best. The exact E-step is used to compute a for paths of up to 9 hops. For paths longer than 
9 hops, we use the causal importance sampling described in Section 14.11 with 2000 samples per 
observation. 

Minimum, median, and maximum edge symmetric difference errors are shown in FigureEl Both 
algorithms have seemingly high error rates, as there are roughly 1300 links in the true network. 
However keep in mind that both algorithms are attempting to fill in the entries of a roughly 
HOOx 1100 matrix. For 50 networks constructed by choosing a random order for the elements of each 
observation the average edge symmetric difference error was 4300, so both algorithms are indeed 
doing considerably better than random guessing. NICO performance is again noticeably better 
than that of the FM; the NICO average error is better than that of the best FM reconstruction, 
and the worst case NICO reconstruction is on par with the average FM performance. We also note 
that the number of false positives and false negatives in a reconstruction using either scheme tend 
to be roughly equal (each constituting half of the edge symmetric difference error). 

Figure El shows statistics for the number of edges in the reconstructed networks. There is an 
interesting correlation between the number of edges and reconstruction accuracy in this example. 
As seen above, the typical NICO reconstruction is more accurate, in terms of edge errors, than 
a FM reconstruction. NICO also consistently returns a sparser estimate. The median number of 
links in a NICO reconstruction is 1329, whereas the median number of links in a FM reconstruction 
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Figure 5: Edge symmetric difference error comparison of NICO and FM on Internet data. The 
reported values come from 50 random initializations of each algorithm. 
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Figure 6: Number of edges in networks reconstructed using each method. The median number of 
edges per reconstruction is 1329 for NICO and 1426 for FM. The true network has 1317 edges, and 
so it appears that NICO does a better job of capturing the complexity of the true network. 



is 1426. There are 1317 edges in the true network, so in this sense the NICO reconstructions more 
accurately reflect the inherent level of complexity in the true network. 

Marginal log- likelihood values for each of the 50 NICO estimates are depicted in Figure [3 The 
marginal log- likelihood, given by (@J), is the cost function being optimized by the EM algorithm. 
In contrast to the experiments with simulated data reported above, there is no exact correlation 
between higher marginal likelihood values and lower edge symmetric difference error for this exam- 
ple. The topology with the highest likelihood value results in an edge symmetric difference error of 
627. This is better than the clairvoyant best FM error, but only average for NICO. The three rep- 
etitions which returned a topology with the lowest symmetric difference error had the next highest 
likelihood value, as indicated by the three hollow circles in the figure. The dashed line shows the 
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Figure 7: Marginal log likelihood values for different random initializations of NICO, sorted in 
ascending order. The three hollow circles correspond to the solutions which achieve the lowest edge 
symmetric difference error of all NICO trials. The red line shows the marginal log likelihood value 
computed using the true path orders to estimate a Markov transition matrix. Most NICO solutions 
have higher marginal log-likelihood than the true topology, suggesting that our generative model 
does not accurately describe Internet topology data. 

likelihood value based on a transition matrix estimated using the true path orders as measured by 
traceroute. Notice that the majority of the NICO solutions have a higher marginal likelihood 
than the true topology. This suggests that our generative model may not be the best match for 
Internet topology data. Still the overall performance of our algorithm is encouraging. 

7 Discussion and Ongoing Research 

This paper presents a novel approach to network inference from co-occurrence observations. A co- 
occurrence observation reflects which vertices are activated by a particular transmission through the 
network, but not the order in which they are activated. We model transmission paths as a random 
walks on the underlying graph structure. Co-occurrence observations are modelled as i.i.d. samples 
of the random walk subjected to a random permutation which accounts for the lack of observed path 
order. Treating the random permutations as latent variables we derive an expectation-maximization 
(EM) algorithm for efficiently computing maximum likelihood or maximum a posteriori estimates 
of the random walk parameters (initial state distribution and transition matrix). 

The complexity of the EM algorithm is dominated by the E-step calculation which is exponential 
in the length of the longest transmission path. In order to handle large networks, we describe fast 
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approximation methods based on importance sampling and Monte Carlo techniques. We derive 
concentration-style bounds on the accuracy of the Monte Carlo approximation. These bounds 
prescribe how many importance samples must be used to ensure a monotonic increase in the log- 
likelihood, thereby guaranteeing convergence of the algorithm with high probability. The resulting 
Monte Carlo EM computational complexity only depends polynomially on the length of the longest 
path. 

To obtain a network reconstruction, we determine the most likely order for each co-occurrence 
observation according to the Markov chain parameter estimates, and then insert edges in the graph 
based on these ordered transmission paths. This procedure always produces a feasible reconstruc- 
tion. The parameter estimates produced by the EM algorithm may be useful for other tasks such as 
guiding an expert to alternative reconstructions by assigning likelihoods to different permutations, 
or predicting unobserved paths through the network as in [12]. One could also analyze properties 
of an ensemble of solutions obtained by running the EM algorithm from different initializations, 
and then posit a new set of experiments to be conducted based on this analysis. 

The transition matrix parameter A^j can be interpreted as estimates of the probability that 
a transmission will be passed from vertex i to j, conditioned on the path reaching i; that is, 
Aij = P[Zk+i = j\Z k = i]. In particular, they are not estimates of the probability of a link 
existing from i to j. Since A is a stochastic matrix, each row must sum to 1, and so if vertex i 
is connected to many other nodes then the unit mass is being spread over more entries. We can 
obtain joint probabilities, P[Z k = i, Z k+ \ = j], via Bayes theorem, 

DI7 ., v -i P[Z k = i, Z k+ i = j] 
P[Z k+1 =j\Z k = i ] = ^— , 

where P[Z k = i] is the stationary distribution of the chain (not necessarily equal to the initial 
state distribution). These joint probabilities (appropriately scaled versions of the transition matrix 
entries) more accurately reflect the likelihood of there being an edge from % to j, based on our 
estimates. 

Our future work involves extending and generalizing both algorithmic and theoretical aspects 
of this work. In our experiments we found that our current model leads to reasonable Internet 
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reconstructions, but we feel there is room for improvement. For example, the structure of Internet 
paths may depend strongly on the destination of the traffic. We are currently investigating models 
based on "mixtures of random walks" to account for this added level of dependency. 

Co-occurrence observations naturally arise from transmission paths in communication network 
applications and, to a degree, in biological, social, and brain networks as well. However the physical 
mechanisms driving interactions in the latter three applications may also correspond to more general 
connected subgraph structures such as trees or directed acyclic graphs. Extending our methods 
in this fashion is easily accomplished in theory, however the computational complexity may be 
significantly increased when more general structures are considered. 

In this paper we have also restricted our attention to noise- free observations. We are also inter- 
ested in extending our algorithm to handle the case where observations reflect a soft probability that 
a given vertex occurred in the path rather than hard, "active" or "not active", binary observations. 
This extension is relevant in many applications including the inference of signal transduction net- 
works (in systems biology) where co-occurrence observations are themselves the result of inference 
procedures run on experimental data. 

A Proof of Theorem [1] 

There are two main steps in the proof of Theorem ^ First, we derive a concentration inequality for 
the importance sample approximations, OLj^ 1 ),, and r^j) . Then we use the inequality to construct a 
bound for A(0) - A(0). 

Recall the expressions (1221) and (|23|) for importance sample approximations calculated in the 
Monte Carlo E-step. Both are of the general form ju^ = J, \ , where Z : ^> n — ► [0, b] and 

X : — ► {0, 1}, and they are approximating fi = X} rg # X(r)P[r\x., A, it]. Note that E[/2l] ^ fj,, 
so standard concentration results such as Hoeffding's inequality or McDiarmid's bounded-differences 
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inequality do not directly apply; e.g., consider the case L = 1: 



E 



Z(ri) 



= J] X(r)P[r|x,A,7r] (41) 
+ X(r)P[r|x,A,7r]. (42) 



We can, however, show that fii yields an asymptotically consistent estimate of fi. Observe that 

E ^ = E wra*^^ (43> 

= 1, (44) 



since P is a probability distribution on ^ at, and 



E[Z(r<)X(r*)] = S4^T X(r)jR[r|X ' A ' 7r] (45) 

= ^ X(r)P[r|x,A,7r] (46) 

re*jv 

= M- (47) 



It follows from the strong law of large numbers that JIl — ► \i as L — > oo. 

The following finite-sample concentration inequality demonstrates that the approximation error, 
fi-N — M, decays exponentially in the number of importance samples, L. 

Proposition 1. Let {(Xi,Zi)} be a sequence of independent and identically distributed random 
variables with X; L G {0,1} and Z; L £ [0,6]. Assume that E[Zj] = 1 and K[ZiXi\ = n, and set 



V>L = ^Al 1 ■ Then with probability greater than 1 — 5, 



/26 2 logf , N 

vl-v < y L sg . (48) 

Proof. From the definitions of Zj and Xj, ZjAj G [0,6]. Applying Hoeffding's inequality [10] yields 
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that for any t > 0, 



and for any t > 0, 



Pr ^2 Z i X i ~ L V> Lt\ < 



-2Lt 2 /b 2 



(49) 



Pr Z i~ L < - L tj 



< e 



-2Lt 2 /b 2 



(50) 



Define the event, E t = {X^iLi z i^i — Lfi > Lij (J |X^=i %i — L < — Lt\. By the union bound, 
Pr(-Bt) < 2e~ 2Lt2 / b2 for any t > 0. The complement of E t implies that for t G (0, 1), 



Mi-/" 



+ 



< 



+ 



L(l-t) L(l-t) 

*(! + //) 
1-i 



(51) 
(52) 
(53) 



It follows that - M > C £ f , and so Pr (j2 L -fi> < Pr(£ 4 ) < 2e~ 2Lt ' '/ b \ Since 

V'L < 1 3 if + /x > 1 then Pr (jii — fi > J = ^> anc ^ the proposition holds trivially. Thus, 

without loss of generality we consider the case + /i < 1, or equivalently, t < (1 — /i)/2. 

This restriction on i implies < 2t, and so we have Pr — fj, < 2t) > 1 — 2e~ 2Lt2 l b2 . Set 



2e 2i * 2 / fe2 to obtain the desired result. 



□ 



We apply Proposition ^ to the Monte Carlo approximations {3+7!//} and {f^)} as follows. 
Recall that the Monte Carlo weights are bounded according to z,- L G [0, b m ], with b m as defined in 
(BED- Define 



u 

\ *Vt" 



a,, 



(m) _ (m) 



t'.t 



" a t',t" — 



\ 



J 



Nrr 



L, 



This is a union over 2(^ m ) + iV m = events, each of which holds with probability at most 5' 



39 



according to Proposition ^ By the union bound it follows that Pr(i3^ L ) < N^S'. Next, define 



^K' T 



E (4> - 47) + E ( f l7 - -17) * V > ' (54) 
f,t"=i *'=i » 



and observe that q? Lm C £™ jLm , therefore Pr(C7J? Lm ) < Pr(£^ L J < JV* Let «J" = N^S' and 
let L > be a value to be determined later. For each m = 1, . . . , T, set 



2LJV^log> 
— : 1 j 



so that 



9 2b 2 m log| „ /2fc2 log^P /log 4 , N 



r m \ t \ t 



Then with probability greater than 1 — 5", 



N m N„ 



f,t"=i t'=i 



Recall that are indicator variables satisfying £^ij'=i 4^14 

n ) = 1 and Ell 4"? = L Mul " 
tiplying each term in (|57|) by the appropriate sum of indicators, rearranging terms, and recalling 
that importance sample estimates for different observations are statistically independent, we have 
that with probability greater than (1 — S") T , 



N m \S\ N m \S\ I, i 

'Am) _(m)\ Jm) . A y °^W 



n E E (4$ - 4$) 4542 + E E - 4? < V ^ ^ (58) 

m=l lt',i"=lij'=l t'=l i=l ' 



which implies that with probability greater than (1 — 8") T 

T N m \S\ T Nm \S 

E E E(4$-4?£)4342 + ££Z 

m=l t',t"=l i,j=l m=l t'=l i=l 



E E E (4$ - 4$) 4342 + £ £ £ ft? - *$) 4? < («» 



Finally, set 1 — 5 = (1 — 5") T and multiply through by | log # m i n | > 0. Then with probability greater 
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than 1 — 5, 



T N m \S\ T N m \S\ 

E E E ( S S» - 4$) i ^ w + EEE( f S ) - r V) 4? i 

m=l i',t"=l i j'=l m=l t'=l i=l 



/-log(l- (l-J) 1 /^) 

< T | log 0^1 W ^ £ '- '-. (60) 

To complete the proof, observe that 

T \S\ N m 

m - A(d) = E E E ( S S - 4t) 454? ^ - lQ g <;) 

m=l ij=l 

T \S\ N m 

+ E E E - *$) 4? - o . (6i) 

m=l i=l ('=1 

By assumption, # m i n < Ai^^A^- < 1 for each (i,j) G 5 2 . It follows that 

log Aij - log A[ j < - log 6> min = | log (9 min | . (62) 

Similarly, log7?j — log7r^ < | log ^ m j n | for each i £ S. Apply these bounds in (|61|) to find that the 
right hand side of Q61|) is no greater than the left hand side of H6UJI . Set 



6 = r|iogfl min h/ log 1 -( 1 - ? ) 1/r . (63 ) 



Then A(6) — A(0) < e with probability greater than 1 — 5. Solve for L in ()63|) and plug the resulting 
value back into (|55[) with 5" = 1 — (1 — d) 1 ^ to obtain the desired result. 



B Proof of Theorem [2] 

To prove Theorem |2] we will show that A(0) > (1 — e) A(0*) with high probability, but first we 
need two preliminary results. We begin by deriving concentration inequalities for the Monte Carlo 
sufficient statistics. Then we use these bounds to show that the corresponding M-step parameter 
estimates, Aij and 7Tj concentrate about their asymptotic means, A* • and ir*. From there we 
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construct the desired bound for A(0), which implies the theorem since A(0*) > by definition. 

The proof of Theorem Q makes use of additive concentration inequalities, bounding the prob- 
ability of deviations of the form jEt£, — \i > t. In this proof we make use of relative concentration 
inequalities to ensure that > (1 + e)fi with high probability. 

Proposition 2. Let {(Xi,Zi)} be a sequence of independent and identically distributed random 
variables with Xi E {0,1} and Zi E [0,6]. Assume that E[Zj] = 1 and ~E[ZiXi] = fi, and set 
V-L= Jtl — ! — L ; as before. Then with probability greater than 1 — S, 



/2761og§ , 



and u>ii/i probability greater than 1 — 5, 



/2761og§ , 

Proof. Since X$ E {0, 1} and Z% E [0, b], Z{X{ E [0, b] also. Applying the relative form of Hoeffding's 
inequality (see, e.g., Theorem 2.3 in [17]), we have that for any j3 > 0, 



Pr(gZ*>(l + 0L M ) < exp{^ /3) 



(64) 



If (3 < 1 then 2(1 + /3/3) < 3, and so for /? E (0, 1] 



Pr [Y^ZiXi> (l + P)Lfij < expj^g^. (05) 



which suffices for our application. Also, for any 7 > 0, 



Pr I V Z t < (1 - i)L \ < exp <J — ^- }> . (66) 
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Suppose the events 



j^Z.Xi < (l + /3)L/i| and |^Z< > (1 - 7 )l| 



(67) 



occur simultaneously. Then for < 7 < 1, 



Since we will apply the union bound, we balance the exponential rates in (|65|) and (|66|) by setting 
7 = < 1. Solving 

1 + 3 1+ " -l + e (69) 



for /? in terms of e leads to 



P = =. (70) 



1 + -v/f/i + eJ |/i 



In order to ensure that /3 < 1 we restrict 



e < -^-1 (71) 



1- Vi^ 



(72) 



Note that the right hand side of the expression above is at least 1 for all fi G [0, 1] . Apply the union 
bound with the complements of the events in (|67|) using (|7U|) in the exponent, and observe that 

1 + Y^ + ey^ ^ 3 for a11 A» e [° 5 1] and 6 G (0. !) to find that Pr {V>L < (1 + e)/i) < 2e~ L ^ 2 / 27b2 . 
Set (5 = 2e~ Lfie2 / 27b2 to obtain the first claim. The proof of the second claim follows a similar 
sequence of steps. See [20] for the full details. □ 

Application of the union bound yields the following. 
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Corollary 1. With probability greater than 1 — 5, 



/27olog4 \ _ / /2761ogf . 

^V^FJ"^ < (W^r 1 !"- (73) 

Next we apply Corollary ^ to the Monte Carlo approximations, {r^}} and {5j+7!>,} towards 
showing that Aij and 7?j do not deviate too greatly from A* ■ and 7r*. Recall the exact M-step 



expressions for n* and A* • given by (|18|). The corresponding expressions for 7Tj and Ajj are found 

md a|,™l with r^j) and By apprc 

and denominators of A^j and 7Tj we obtain the following result. 



by replacing each and or™},, with and aj,™L By appropriately bounding the numerators 



Proposition 3. Let L > and 5 > be given. Assume that there exists A > such that fj"?) > A 
and a+ni > A /or all m = 1, . . . , T and i', i" = 1, . . . , N m . If at least L m > 27b ™ L importance 
samples are used, then with probability at least 1 — (X)m=i ^m)^' 

%> | 1 V ,JL. 1 t; > I . (71) 





Proof. First recall that there are 2(^ m ) + iV m = iV^ sufficient statistics associated with the rath 
observation: one a^„ for each of the 2(^T l ) possible transitions and one rj™, for each possible 
initial state. Then, in total there are Ylm=i sufficient statistics to calculate in the E-step. 
Applying the union bound in conjunction with (jZHj) we have that with probability greater than 

i-(ELiO*> 



_(m) , m mOc\,/n\og- s ^ (m) _ (m) / 21b m a\, log j 



for all m = 1, . . . , T and t', t" = 1, . . . , iV m , and 



f m _ < ^ < r « + (76) 

for all m = 1, . . . , T and t! = 1, . . . , N m . Based on the assumption that a^ 1 ),, > A and fj^Jr > A, 
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taking L m > 27b m L/X guarantees that 



. , 27b m L nb m L 

Lm > max max — 7-^—; max 



t',t"=l,...,N m o (m) f=l,.»,iVm M m ) 

Then with probability greater than 1 — (X^m=i ^m)^> 



for all m = 1, . . . , T and i', £" = !,..., iV m , and 



for all m = 1, . . . , T and t' = 1, . . . , iV m . 

Equation (|78|) implies that for each £ S 12 , 



and for each i £ S, 



\S\ T (m) / / 1 „„_4\ l s l T iV„ 



/ i 4 1 P ^ JV m 



k=lm=lt',t" \ V / lc=lm=l(',i"=l 



Taking the ratio of these two expressions yields the desired result for Aij and A* 
Similarly, ()79|) implies that for each i, 



m=l t'=l V / "i=l f 
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and for each i, 



T Nn 



m=l t'=l V / m=1 *' 

Taking the ratio of these two expressions yields the desired result for tt $ and ir* . □ 

The remainder of the proof of Theorem |2] is now fairly straightforward. Let 5 > be the 
value given in the statement of Theorem |2j Monotonicity of the logarithm in conjunction with 
Proposition El implies that with probability greater than 1 — 5, for all i,j £ S, 



1 _ , / log4E^i^-log^ 

log A J > log 4*- + log I V L =\, (82) 

1 1 , / l°g4E^l^-l°g<? 





, N 2 

-1 1 v m 


-log (5 


L 


'log^EL 


, AT 2 
-1 m 


-log <5 


L 



logTTi > log < + log | *~ v ; , = == | • (83) 

Multiply through by either l£>=i 4?Mv4J > or ELl £&l 4?tf'4!? > as 

appropriate, and sum over i and j to obtain 

Q{d;G') > Q{0*-G') 



\S\ T N m \S\ T N m \ I JIos^TnI 

.£,.7=1 m=l t',t"=l i=lm=li'=l / 



(84) 

By the definitions of f[™, , and given in Section|3]we have X^=i ^i™ = 1' Et^T"=i ^t™" = 

S| _M (m) 
i,j=l x t",i x t>,j 



Nm-1, EKLi 414? = !> and Effli 4"? = 1- It follows that 



|S| T N m \S\ T N m T 

E E E 4^4542 + E E E * } 4? = E N - ( 85 ) 

i,j=l m=l t',t"=l i=lm=lt'=l m=l 
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Subtract Q(0'; 0') from both sides of (|84|) to obtain that with probability greater than 1 — 5, 



A(0) > A(0*) + [J2 N m log 



vm=l 



1 + 



L 

log^ESLl^^g* 



(86) 



Let e > be the value given in the statement of Theorem |2] and set 



N ™ ^g 



\m=l 



1 + 



L 



-eA(6*) 



(87) 



Solving for L yields 



(l 



L 



+ exp 



I E™ =1 / 



log 



4V T . AT 2 



Recall the well known inequality: u > log(l + u) for u > 0. Putting u = J^t^ 1 > leads to 



eA(6>*) 

Em=l ^ 



> 1 + 



eA(6>*) 



(89) 



Take the exponential, which is a monotonic transformation, and then invert the resulting expression 
to obtain 



cxp 



-eA(6>*) 



< 1 + 



eA(0*) 



(90) 



It follows that 



1 + exp 



r -*a(0-) i 



1 — exp 



< 



2Em=l^m + 6A(fl* 

eA(0*) 



(91) 



Using this last result in ()88|) , together with the choice of L m from Proposition |31 we find that if we 
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use 



27b m ( 2El=iN m + eA(0*) \ * ( 4ELi^ rq9 x 

Lm " — [ 7K(en ) lo8 { s J (92) 

importance samples for the mth observation in the Monte Carlo E-step, then A(0) > (1 — e)A(0*) 
with probability greater than 1 — 5. Since A(0*) > by definition we may take e = 1. Then 
A(0 ) > with probability greater than 1 — 5. 
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